Data fusion - Resolving Data Conflicts for Integration
نویسندگان
چکیده
The amount of information produced in the world increases by 30% every year and this rate will only go up. With advanced network technology, more and more sources are available either over the Internet or in enterprise intranets. Modern data management applications, such as setting up Web portals, managing enterprise data, managing community data, and sharing scientific data, often require integrating available data sources and providing a uniform interface for users to access data from different sources; such requirements have been driving fruitful research on data integration over the last two decades [13, 15]. Data integration systems face two folds of challenges. First, data from disparate sources are often heterogeneous. Heterogeneity can exist at the schema level, where different data sources often describe the same domain using different schemas; it can also exist at the instance level, where different sources can represent the same real-world entity in different ways. There has been rich body of work on resolving heterogeneity in data, including, at the schema level, schema mapping and matching [17], model management [1], answering queries using views [14], data exchange [10], and at the instance level, record linkage (a.k.a., entity resolution, object matching, reference linkage, etc.) [9, 18], string similarity comparison [6], etc. Second, different sources can provide conflicting data. Conflicts can arise because of incomplete data, erroneous data, and out-of-date data. Returning incorrect data in a query result can be misleading and even harmful: one may contact a person by an out-of-date phone number, visit a clinic at a wrong address, carry wrong knowledge of the real world, and even make poor business decisions. It is thus critical for data integration systems to resolve conflicts from various sources and identify true values from false ones. This problem becomes especially prominent with the ease of publishing and spreading false information on the Web and has recently received increasing attention. This tutorial focuses on data fusion, which addresses the second challenge by fusing records on the same real-world
منابع مشابه
A Relational Operator Approach to Data Fusion
Integrated information systems provide users with a single unified view to heterogeneous data sources. As the resolution of schema level conflicts and the detection of fuzzy duplicates has been looked at more comprehensively, the problem of resolving data level conflicts still remains. We propose a relational data fusion operator, which fuses tuples representing the same real world entity by re...
متن کاملEliminating NULLs with Subsumption and Complementation
In a data integration process, an important step after schema matching and duplicate detection is data fusion. It is concerned with the combination or merging of different representations of one real-world object into a single, consistent representation. In order to solve potential data conflicts, many different conflict resolution strategies can be applied. In particular, some representations ...
متن کاملWorking Paper Alfred P. Sloan School of Management a Metadata Approach to Resolving Semantic Conflicts a Metadata Approach to Resolving Semantic Conflicts a Metadata Approach to Resolving Semantic Conflicts
Semantic reconciliation is an important step in determining logical connectivity between a data source (databcise) and a data receiver (application). Semantic reconciliation is used to determine if the semantics of the data provided by the source is meaningful to the receiver. In this paper we describe a rule-bzised approach to semantic specification and demonstrate how this specification can b...
متن کاملResolving Structural Conflicts in the Integration of XML Schemas: A Semantic Approach
While the Internet has facilitated access to information sources, the task of scalable integration of these heterogeneous data sources remains a challenge. The adoption of the eXtensible Markup Language (XML) as the standard for data representation and exchange has led to an increasing number of XML data sources, both native and non-native. Recent integration work has mainly focused on developi...
متن کاملPreprocessing and Integration of Data from Multiple Sources for Knowledge Discovery
The explosive growth in the generation and collection of data has generated an urgent need for a new generation of techniques and tools that can assist in transforming these data intelligently and automatically into useful knowledge. Knowledge discovery is an emerging multidisciplinary field that attempts to fulfill this need. Knowledge discovery is a large process that includes data selection,...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- PVLDB
دوره 2 شماره
صفحات -
تاریخ انتشار 2009